elasticsearch IK中文分词插件

前言

Elasticsearch中默认的标准分词器，在处理中文的时候会把中文单词切分成一个一个的汉字。比如：“中华人民共和国国歌”，其默认的标准分词器会将其切分成“中”，“华”，“人”，“民”，“共”，“和”，“国”，“国”，“歌”。对中文分词完全背离了其真实的语义，搜索效果很差，所以就需要我们安装其他中文分词插件。

IK Analysis 插件

IK Analysis 插件就是一款专门用于Elasticsearch的分词器，可以友好的处理中文。同时也支持自定义的分词规则。

安装插件：

./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.8/elasticsearch-analysis-ik-5.6.8.zip

卸载插件：

./elasticsearch-plugin remove analysis-ik

注：elasticsearch 5.5.1版本之后，才支持使用此方式安装。安装版本最好和elasticsearch版本保持一致，避免安装冲突。安装完成之后，重启elasticsearch服务使插件生效。如果是多节点组成的集群，则需要在各节点中都安装IK Analysis 插件。

分词器

ik自带有两种分词器:ik_max_word 和 ik_smart。

ik_max_word

将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国, 中华人民, 中华, 华人, 人民共和国, 人民, 人, 民, 共和国, 共和, 和, 国国, 国歌”，会穷尽各种可能的组合；

ik_smart

做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国，国歌”。

对比测试

以下是各种分词器效果结果对比。

默认分词器

curl -u elastic:changeme -X GET -d '{"text":"中华人民共和国国歌"}' 'http://192.168.1.111:9200/_analyze?pretty'
{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "华",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "人",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "民",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "共",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "和",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "国",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "歌",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    }
  ]
}

ik_max_word

curl -u elastic:changeme -X GET -d '{"analyzer":"ik_max_word", "text":"中华人民共和国国歌"}' 'http://192.168.1.111:9200/_analyze?pretty'
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "中华人民",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "中华",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "华人",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "人民共和国",
      "start_offset" : 2,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "人民",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "共和国",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "共和",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "国",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 8
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 9
    }
  ]
}

ik_smart

curl -u elastic:changeme -X GET -d '{"analyzer":"ik_smart", "text":"中华人民共和国国歌"}' 'http://192.168.1.111:9200/_analyze?pretty'
{
  "tokens" : [
    {
      "token" : "中华人民共和国",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "国歌",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

自定义分词规则

IK Analysis分词插件的规则文件存放目录：elasticsearch-5.6.8/config/analysis-ik

-rw-rw----. 1 lipanpan root 5225922 Apr 20 22:15 extra_main.dic
-rw-rw----. 1 lipanpan root   63188 Apr 20 22:15 extra_single_word.dic
-rw-rw----. 1 lipanpan root   63188 Apr 20 22:15 extra_single_word_full.dic
-rw-rw----. 1 lipanpan root   10855 Apr 20 22:15 extra_single_word_low_freq.dic
-rw-rw----. 1 lipanpan root     156 Apr 20 22:15 extra_stopword.dic
-rw-rw----. 1 lipanpan root     625 Apr 20 22:15 IKAnalyzer.cfg.xml
-rw-rw----. 1 lipanpan root 3058510 Apr 20 22:15 main.dic
-rw-rw----. 1 lipanpan root     123 Apr 20 22:15 preposition.dic
-rw-rw----. 1 lipanpan root    1824 Apr 20 22:15 quantifier.dic
-rw-rw----. 1 lipanpan root     164 Apr 20 22:15 stopword.dic
-rw-rw----. 1 lipanpan root     192 Apr 20 22:15 suffix.dic
-rw-rw----. 1 lipanpan root     752 Apr 20 22:15 surname.dic

自定义分词规则：

# 这里自定义分词规则统一放在custom目录下
mkdir custom
vim custom/lpp_word.dic
# 加入以下配置
王者荣耀
一带一路

更新配置文件：

vim IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/lpp_word.dic</entry>
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典:http://xxx.com/xxx.dic -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典:http://xxx.com/xxx.dic-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重启elasticsearch服务使配置生效即可！

生效前：

curl -u elastic:changeme -X GET -d '{"analyzer":"ik_max_word", "text":"王者荣耀"}' 'http://192.168.1.111:9200/_analyze?pretty'
{
  "tokens" : [
    {
      "token" : "王者",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "荣耀",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    }
  ]
}

生效后：

curl -u elastic:changeme -X GET -d '{"analyzer":"ik_max_word", "text":"王者荣耀"}' 'http://192.168.1.111:9200/_analyze?pretty'
{
  "tokens" : [
    {
      "token" : "王者荣耀",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "王者",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "荣耀",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    }
  ]
}

热更新分词规则

该插件支持热更新分词规则。

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">location</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<entry key="remote_ext_stopwords">location</entry>

其中 location 是指一个 url，比如http://yoursite.com/getCustomDict，该请求只需满足以下两点即可完成分词热更新。

该http请求需要返回两个头部(header)，一个是 Last-Modified，一个是 ETag，这两者都是字符串类型，只要有一个发生变化，该插件就会去抓取新的分词进而更新词库；
该http请求返回的内容格式是一行一个分词，换行符用 \n 即可；
分词内容以UTF-8格式编码；

满足上面三点要求就可以实现热更新分词了，不需要重启ES实例。可以将需自动更新的热词放在一个UTF-8编码的.txt文件里，放在nginx或其他简易http server下，当.txt文件修改时，http server 会在客户端请求该文件时自动返回相应的Last-Modified和ETag。可以另外做一个工具来从业务系统提取相关词汇，并更新这个.txt文件。

参考链接